A Concept-Based Framework for Passage Retrieval at Genomics
نویسندگان
چکیده
The task of TREC 2006 Genomics Track is to retrieve passages (from part to paragraph) from full-text HTML biomedical journal papers to answer the structured questions from real biologists. A system for such task needs to be able to parse the HTML free-texts (convert the HTML free-texts into plain texts) and pinpoint the most relevant passage(s) within documents for the specified question. This task is accomplished in three steps in our system. The first step is to parse the HTML articles and partition them into paragraphs. The second step is to retrieve the relevant paragraphs. The third step is to identify the most relevant passages within paragraphs and finally rank those passages. We are interested in 1. How does a concept-based IR model perform on structured queries comparing to Okapi? 2. Will the query expansion based on domain knowledge increase retrieval effectiveness? 3. Will our abbreviation database from MEDLINE help improve query expansion and will the abbreviation disambiguation help improve the ranking? The experiment results show that our concept-based IR model works better than the Okapi; query expansion based on domain knowledge is important, especially for those queries with very few relevant documents; an abbreviation database for query expansion and disambiguation is helpful for passage retrieval. 1. Step 1: HTML parsing and document partitioning. The HTML full-text journal articles use special symbols to represent Latin characters. For example, the Latin character β can be written as “ß” or “ß”. Since this kind of Latin characters are commonly used for naming genes, it is important to translate them into plain texts (e.g., replace “ß” with “beta”). We used the ISO 8859-1 (Latin-1) characters list for the translation. Sometimes GIF files are used to represent the Latin characters. For example, the HTML tag is used in one of the articles to represent β. Notice that term “beta” is included inside the HTML tag and special processing is needed in the tag handler of the HTML parser. In the end, step 1 will partition each document into paragraphs according to the HTML tag. 2. Step 2: Paragraph retrieval This step retrieves the top 2,000 most relevant paragraphs, which will be used as the input of step 3 for passage retrieval. Several techniques have been conducted in this step. These techniques are explained in detail in the following sections. 2.1 Conditional Porter stemming Stemming is employed to recognize variants of the same word, which will also reduce the number of terms indexed. Porter stemmer [1] is widely used in IR community. To determine whether this stemmer is suitable to process biomedical texts, we applied it to both the queries and documents and examined the following two aspects: 1. How are the query words stemmed? 2. What are those words that are equal to the query words after they are stemmed? We observed that Porter stemmer is suitable in most of cases except the following two: Case 1: A gene name in the query is changed into a totally different non-gene word after stemming. For example, the stemmer changes gene “Pes” to “Pe”. Gene “IDE” is stemmed into “ID”. Case 2: A non-gene word becomes a gene name after stemming. For example, “IDEE”, after stemming, becomes the gene “IDE”. Gene names in the queries are very important biomedical concepts. The performance of an IR system could be extremely degraded if the above two cases happen. To utilize the Porter stemmer and avoid the above two cases, we employed a conditional stemming strategy. With this strategy, whether a word should be stemmed or not depends upon both the original word and the potential stemmed word. Given a word w, we classify w into the following 3 categories: 1. G: gene names. This category is further divided into two sub-categories: G1: gene names ended with numbers (e.g., “HNF4”) and G2: gene names not ended with numbers (e.g., “APC”). 2. E: regular English words 3. N: neither gene names nor English words. w is a gene name if it is in the Entrez Gene database. w is an English word if it is in the WordNet dictionary. w could be both a G and a E, for example “shot” is a common English word and also a gene name. The conditional Porter stemming strategy is given as follows: suppose a word w is potentially stemmed into w' by the Porter stemmer. Stemming is skipped if any of the following conditions is satisfied: 1) w ∈ G 2) (w ∈ E) ∧ (w ∉ G) ∧ (w' ∈ G2) 3) (w ∈ N) ∧ (w' ∈ G2) There is an exception for condition 3. We observed that the stemming “GSTMs” → “GSTM” is acceptable, in which w ∈ N, w' ∈ G2, w' is ended with an uppercase alphabetic character, and w = w' + s. This exception is only for “s” because usually the plural of a gene name is formed by adding an “s” at the end of the gene name. Condition 1 will guarantee that case 1 be avoided. Condition 2 will make sure case 2 does not happen for English words. Condition 3 will avoid case 2 for words that are neither gene names nor English words. 2.2 Handling lexical variants of gene names New gene names and their lexical variants are regularly introduced into the biomedical literature [2, 3]. However, many reference databases, such as UMLS and Entrez Gene (formerly named as LocusLink), may not be able to keep track of all this kind of variants. [4, 5] have demonstrated that expanding gene names with their lexical variants will improve the performance of information retrieval of bio-medical literature. In our system, the lexical variants of a gene come from two sources: 1) automatically generated according to a usual strategy [4, 5] based on the features of the guidelines for human gene nomenclature [6]; 2) or retrieved from an abbreviation database created from MEDLINE. Here we only explain the 2 strategy. ADAM [7] is an abbreviation database which covers frequently used abbreviations and their definitions (or long-forms) within MEDLINE titles and abstracts, including both acronyms and non-acronym abbreviations. An important feature of ADAM is that morphologically similar abbreviations are clustered together. For example, “5HT” is an abbreviation for “5-Hydroxytryptamine”. ADAM shows that “5-Hydroxytryptamine” could be abbreviated as “5-HT”, “5HT”, “5-ht”, “5-Ht”, and “5-H-T” in the literature. This feature of ADAM could be used to find extra lexical variants of gene names, other than those that are automatically generated. A gene abbreviation could have a non-gene long-form. For example, gene “APC” could be an abbreviation for “Air Pollution Control”. To find the lexical variants of a gene abbreviation, we first need to know what the longform of the gene is. This is accomplished either by extracting the long-form from the queries themselves or searching the Entrez Gene database. After the long-form of the abbreviation gene is identified, we search the long-form in ADAM. Suppose the abbreviation gene is “PRNP” and we identified “Prion protein gene” as the long-form from the Entrez Gene database. From ADAM, we find “PRNP” and “Prion protein gene” is an abbreviation/long-form pair and “Prion protein gene” could be shortened as “Prnp”, “prn-p”, or “prnp” (“PRNP”, “Prnp”, and “prnp” are the same after tokenization). Notice that “prn-p” will not be automatically generated by the 1 strategy. 2.3 Utilizing domain knowledge Domain knowledge is critical for query expansion. A biomedical term may have many different ways of saying it. For example, “high blood pressure” and “hypertension” refer to the same vascular disease. “hypocretin-2 receptor” and “orexin B receptor” are two different ways of saying the same receptor. This phenomenon is very common in the biomedical literature. Acquisition and utilization of this kind of domain knowledge will be very useful for retrieving more relevant documents. We define a concept as a biomedical meaning or sense [8]. We consider 1) a gene and its synonym set refer to the same concept. 2) a medical subject heading (MeSH) and its synonym set refer to the same concept. 2.3.1 Identifying query concepts A query usually contains several concepts. For example, the query “purification of rat IgM” has 3 concepts: “purification”, “rat”, and “IgM”. This section will describe how to automatically identify concepts from a query. A concept, in our system, could be a gene name or a MeSH term. Basically we need to identify gene names and MeSH terms from a query. The gene names, in the queries of the genomics track of TREC 2006, are already specified by the query templates [9]. To extract the MeSH terms from a query, we utilize the PubMed Automatic Term Mapping [10]. Given a query, we submit the whole query to PubMed. PubMed will then return a file in which the MeSH terms in the query are marked. 2.3.2 Retrieving concept information from biomedical thesaurus For each MeSH term, the MeSH database gives its synonyms, hypernyms (more generic terms), and hyponyms (more specific terms). For each gene name, we retrieve its synonyms from the Entrez Gene database. We also collected 22,446 gene names from the UMLS (they are concepts that map to "Gene or Genome" semantic type). Synonyms of these genes were retrieved from UMLS. 2.3.3 Finding related concepts Related concepts (not synonyms, hypernyms or hyponyms), in some cases, could be very useful. For example, a query is asking for the information about “the gene HNF4 and COUP-tf I in the suppression in the function of the liver”. We observed that some relevant documents in the 2005 document collection are talking about the role of “HNF4” and “COUP-tf I” in regulating “hepatitis B virus” transcription. However, “hepatitis B virus” is known as a virus that could cause serious damage to the function of liver. The relationships among these three concepts could be described as A ↔ B ↔ C where “↔” indicates the two concepts on two sides are related. In the above example, A is the “HNF4” and “COUP-tf I”. B is “hepatitis B virus” and C is “liver”. The query is asking for A and C, but the relevant document is about A and B. Queries from TREC genomics track were collected from real biologists. Some of the queries represent the information needs of their latest research. There may be very few relevant papers exist in the literature. Related concepts could be very useful for this kind of queries. This issue is very related to [11] for identifying implicit relationships between two disjoint literatures. A semantic type is a category assigned to concepts based on their intrinsic and functional properties [12]. We require the related concept B be related to A and C not only in free-texts (i.e., B co-occurs frequently with both A and C in the documents), but also on the semantic level (i.e., the semantic type of B interconnects with both the semantic type of A and the semantic type of C). For any concept X, let S(X) be the set of semantic types of X in the UMLS semantic networks. Given two concepts A and C, algorithm 1 describes a method of finding the related concepts B: Algorithm 1 Finding related concepts 1. U ← the semantic types that are related to both T1 ∈S(A) and T2 ∈S(C) 2. B1 ← the concepts that co-occur with A in free-texts 3. B2 ← the concepts that co-occur with C in free-texts 4. B ← B1 ∩ B2 5. Filter B by removing X with X ∈B and S(X) U The size of B could be very big. In practice we only need those concepts that are most related to A and C. For each X ∈B, we assign a score which will be used to indicate the association of X with A and C. Given two concepts C1 and C2, we use the mutual information [13] to measure their association:
منابع مشابه
IIT TREC 2007 Genomics Track: Using Concept-Based Semantics in Context for Genomics Literature Passage Retrieval
For the TREC-2007 Genomics Track [1], we explore unsupervised techniques for extracting semantic information about biomedical concepts with a retrieval model for using these semantics in context to improve passage retrieval precision. Dependency grammar analysis is evaluated for boosting the rank of passages where complementary subject/object concept pairs can be identified between queries and ...
متن کاملConcept Based Document Retrieval for Genomics Literature
The 2006 TREC Genomics evaluation focuses on document, passage and aspect retrieval in the genomics domain. The Erasmus Medical Center, TNO and University of Twente collaborated on an approach combining concept tagging (named entity recognition) and information retrieval based on statistical language models. Experiments on the 2004 collection show that document retrieval based on concepts could...
متن کاملI2R at TREC 2006 Genomics Track
This paper describes the method we used for the Genomics Track of TREC 2006. BM25 model is implemented to retrieve relevant documents. We also tried to re-ranking documents based on the initial retrieval before passage retrieval. Passages are retrieved based on the concepts defining in topics and concept coverage. Results of submitted runs are listed and discussed.
متن کاملYork University at TREC 2006: Genomics Track
Our Genomics experiments mainly focus on addressing four problems in biomedical information retrieval. The four problems are: (1) how to deal with synonyms? (2) how to deal with the frequent use of acronyms? (3) how to deal with homonyms? (4) how to deal with the document-level retrieval, passagelevel retrieval and aspect-level retrieval? In particular, we use the automatic query expansion algo...
متن کاملIIT TREC 2006: Genomics Track
For the TREC-2006 Genomics Track, we report on the effectiveness of composite information retrieval functions based on a dimensional data model for improving document, passage, and aspect search precision of genomics literature. We designed an approach, and developed a corresponding search engine, based on a novel dimensional data model capable of document, paragraph, sentence, and passage leve...
متن کاملYork University at TREC 2007: Genomics Track
Our Genomics experiments in this year mainly focus on improving the passage retrieval performance in the biomedical domain. We address this problem by constructing different indexes. In particular, we propose a method to build word-based index and sentence-based index for our experiments. The passage mean average precision (passage MAP) for our first run “york07ga1” using the word-based index w...
متن کامل